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ABSTRACT 

Mobile genetic elements (MGEs) account for a sig- 
nificant fraction of eukaryotic genomes and are impl- 
icated in altered gene expression and disease. We 
present an efficient computational protocol for MGE 
insertion site analysis. ELAN, the suite of tools 
described here uses standard techniques to identify 
different MGEs and their distribution on the genome. 
One component, DNASCANNER analyses known in- 
sertion sites of MGEs for the presence of signals 
that are based on a combination of local physical 
and chemical properties. ISF (insertion site finder) 
is a machine-learning tool that incorporates infor- 
mation derived from DNASCANNER. ISF permits 
classification of a given DNA sequence as a poten- 
tial insertion site or not, using a support vector 
machine. We have studied the genomes of Homo 
sapiens, Mus musculus, Drosophila melanogaster 
and Entamoeba histolytica via a protocol whereby 
DNASCANNER is used to identify a common set of 
statistically important signals flanking the insertion 
sites in the various genomes. These are used in ISF 
for insertion site prediction, and the current accur- 
acy of the tool is over 65%. We find similar signals at 
gene boundaries and splice sites. Together, these 
data are suggestive of a common insertion mechan- 
ism that operates in a variety of eukaryotes. 

INTRODUCTION 

A prime challenge in current genomic studies is the accur- 
ate annotation of genomes with regard to locating genes, 
identifying promoters, CpG islands, repeats and indeed all 
significant features that could have a biological conse- 
quence. There are a number of computational approaches 
to this long-standing problem, many of which have been 
translated into software tools. Well-known examples include 



the whole-genome annotation protocols Ensemble (1), 
MaGe (2), GenDBan (3) and RiceGAAS (4) for the com- 
plete annotation of both prokaryotic and eukaryotic gen- 
omes. For the identification of well-characterized entities 
such as protein coding genes, a number of programs (5-7) 
are highly successful for both eukaryotic and prokaryotic 
genomes. 

Precisely what constitutes a significant feature on the 
genome can be difficult to define, especially when the cor- 
responding function is not well understood. Mobile 
genetic elements (MGEs) fall in this category. They are 
clearly important, constituting a major proportion of 
most eukaryotic genomes, and it is presumed that they 
are responsible for the significant expansion in genome 
size (8), while their role in regulation is still being un- 
covered (9). Originally such elements were considered to 
be parasitic (10) but recent studies have shown MGE to be 
involved in gene inactivation (11), transduction (12) and 
regulation (9). Their role in human genetic diseases (13) is 
also becoming apparent. A major class of MGE arises 
from a retrotransposition mechanism (14) and it is pre- 
sumed that these have played a major role in genome evo- 
lution since they are more numerous in higher eukaryotes 
in comparison to more primitive organisms. 

The problem we address in this article is the identifica- 
tion of potential insertion sites for MGEs. Clearly this de- 
pends both on the nature of the element and on the 
composition of the genome and the approach that we 
take here is integrative, using existing tools for the identi- 
fication of elements in conjunction with bioinformatics 
analysis of the local environment of each class of element. 
This then makes it possible to discover new insertion sites 
as well as to analyze element behavior across genomes. 

We briefly discuss the context of the problem. A number 
of methods have been developed over the years for the 
identification of transposable elements (TEs), and these 
have been summarized by Bergman and Quesneville (15). 
This article also drew attention to the fact that while existing 
methods — that use tools that include homology, com- 
parative genomics as well as de novo techniques — can be 
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combined into fairly successful TE detectors, the discovery 
and annotation of new TEs remains a challenge in 
bioinformatics. 

Mobile elements face relatively low-selection pressure, 
and thus evolve quite rapidly. This has two consequences. 
First, MGEs are often difficult to detect due to the fact 
that there is a high-nucleotide sequence divergence among 
them. Most copies of these elements are truncated at either 
or both of the 3' and 5' ends. This can be a problem in 
studying older elements whose classification sometimes 
requires considerable subjectivity. 

A second consequence of low selection pressure relates 
to the local environment of the element, namely the inser- 
tion site itself. While these are also evolving rapidly, what 
we see at the present time is the set of TEs that have 
managed to stay on in a given set of sites, and not neces- 
sarily all the sites where TEs attempted insertion. In other 
words, the locations where we see MGEs today are those 
that were good insertion sites and have been good 'reten- 
tion' sites. How the insertion sites have evolved, and 
whether their evolution is related to their capacity for 
retention of elements are important questions, but lie 
somewhat outside the scope of the present work. 

The distribution of retrotransposons themselves is quite 
variable, with EhLINEs/EhSINEs occupying <20% of 
the Entamoeba genome (16,17), while LINEs and SINEs 
comprise ~50% of the human genome. Further, their 
internal architecture and the mechanism of their insertion 
and proliferation within the genome is quite distinct from 
that of protein-coding sequences. Additional complexity 
arises from the fact that each genome and its respective 
elements present a unique case in terms of data manage- 
ment and analysis. Sequencing methodologies and data 
formats differ, ranging from unassembled contig or scaf- 
fold data in the case of Entamoeba histolytica, to com- 
pletely assembled genomes such as the 23 chromosomes 
of the human genome. The heterogeneity of sequences and 
the huge amount of data spread across a wide variety of 
useful genome databases such as Ensembl (http://www 
.ensembl.org), FlyBase (http://www.fiybase.org) and 
Genbank (http://www.ncbi.nlm.nih.gov/) present another 
bioinformatics challenge for interspecies analysis. 

In this article, we describe a computational protocol, 
ELAN website, http://nldsps.jnu.ac.in/elan.htnil, which 
is targeted towards genome-wide retro-transposon 
element analysis. Given an element that has been identi- 
fied either by some existing protocol (15) or by the internal 
module ELEFINDER that is provided within ELAN, 
different components permit the following analyses: 

(i) Genome-wide distribution profiles of elements cur- 
rently present on the genome (to any desired level 
of homology). This is performed by the module 
ELEFINDER. 

(ii) Extraction of consensus insertion sites and 
their subsequent analysis in terms of a number of 
physical and chemical properties. Retro-transposon 
insertion sites have distinctive structural features 
that derive from their specific composition. 
Physical properties which are pertinent in this 
context include DNA bendability (18,19), and 



propeller twist (20) as well as thermodynamic 
features such as stacking energy (21), duplex stabil- 
ity (22,23) and denaturation energy (24). The 
program DNASCANNER analyses insertion 
hotspots of elements in detail and provides a set 
of signals or characteristics that are potentially 
recognized by an element for its insertion. 

(iii) Prediction of potential insertion sites via ISF, an 
Insertion Site Finder tool that employs machine- 
learning techniques that use the results from the 
application of DNASCANNER. 

(iv) Comparative genomics of MGEs via a relational 
database, InSiDe (Insertion Site Database) which 
keeps a growing list of MGEs, their locations and 
their insertion environment. 

The software and the data generated from the present 
study can be accessed online at the site http://nldsps.jnu 
.ac. in/elan. html. 

MATERIALS AND METHODS 

All DNA sequences continuously undergo mutations, in- 
sertions and deletions. In this scenario, determining what 
constitutes an 'element' can be a computational challenge. 
For our purpose, an 'element' is a partial or complete 
DNA sequence that has many of the characteristics of a 
given family of TEs such as a reverse transcriptase domain, 
endonuclease domain, poly A tail and intact boundaries 
determined through target site duplications (TSDs), etc. 
The element itself is generally constructed as a consensus 
of several copies retrieved through database search and 
subsequent processing is needed to construct a master or 
ancestor gene (25). A 'fragment', namely a sub sequence of 
an element occurs when reverse transcription is incomplete 
or by indel events, post-insertional mutations and so on. 
We divide fragments into three classes based upon their 
termini: 5' truncated-3' intact; 5' intact-3' truncated; and 
5' truncated-3' truncated (Supplementary Figure SI). 

ELAN 

ELAN is a series of programs that run sequentially, each 
component attempting to address a question of biological 
relevance. As shown in Figure 1, the pipeline is divided 
into three major parts. First, copies of a given TE are 
located in the genome via a BLAST N (7) search, using 
Perl/Bioperl (26) scripts to parse output files. Elements as 
well as flanking sequences are extracted and processed, 
mainly to identify and merge fragments. Additional 
programs find truncation hotspots, the distances of TEs 
to genes as well as to other elements. Redundancies are 
removed by a standard protocol. 

A number of genomes such as Homo sapiens, Mus 
musculus and E. histolytica and insertion elements listed 
in Repbase (27) have been included in the present study 
and information is kept current through continuous 
updation on the site http://nldsps.jnu.ac.in/elan.html. 
ELAN has been benchmarked against RepeatMasker 
and sample files as well as programs are given for com- 
parison purposes in the Supplementary Data. 
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Figure 1. Schematic view of the various programs in the ELAN pipeline. 



DNA SCANNER 

DNA SCANNER scans genomic DNA for a number 
of different physicochemical properties by incorporating 
biophysical, thermodynamic, protein interactions and 
sequence based features. The algorithm is outlined in 
Figure 2. Based on a choice of input parameters, the pro- 
gram evaluates a number of properties in moving windows 
along the length of the query DNA sequence. Substrings 
of window size w are generated from the 5'-end of input 
DNA sequences, and further divided into words (Di/Tri 
nucleotides). For each of the properties (see below) par- 
ameter values 7* are derived and an average score S p is 
computed as a function of position to generate a graph. 

Structural signals: DNA bendability 

DNA bendability is the ability of DNA to deform under a 
specific stimulus such as protein binding. Several models 
have been proposed to study relationship of sequence with 
structural bendability: there are both a trinucleotide model 
based on DNAse 1 cutting frequencies (18), and a di- 
nucleotide model based on X ray crystallography of 
DNA oligomers, and kinetoplast DNA (28) in gel migra- 
tion studies. It has been observed that highly bendable 
DNA contains tracts of A with 'loose periodicity' (29). 
The trinucleotide model (18) is based on the observations 
that certain enzymes such as DNAase I preferably bind 
and cuts DNA that is bent (or bendable) towards the 
major groove, and thus DNAse I cutting frequencies on 
naked DNA can be taken as quantitative measures of 
major groove compressibility and anisotropic bendability. 

We have used this parameter earlier for studying the 
nature of pre-insertion loci (30) where the TPRT mechan- 
ism suggested that restriction endonuclease involved in 
retrotransposition appeared to require bent DNA for 



binding and nicking. Apart from this, specific bendability 
has been used as a feature to recognize E. coli promoters 
(31). 

Thermodynamic signals: stacking energy 

The relationship between sequence, structure and stacking 
energy has important biological consequences (32), for 
example, for sequence-specific interactions with proteins 
(33). Stacking energies are indicators of stability of the 
given DNA sequence as well as of protein interactions, 
and thus play an indirect role in formation of local struc- 
ture (34,35). These include contributions from electrostat- 
ic, polarization, dispersion and repulsive forces. The total 
interaction energy of each stacked base pair is the sum of 
intra and inter-strand stacking energies (21). The stacking 
energies as a function of the rotational angle and separ- 
ation distance between complementary pairs of all 16 di- 
nucleotides have been used in DNA SCANNER. 

Duplex stability: free energy signals 

The relative stability of DNA duplex structure depends 
upon its base sequence (23), and more specifically upon 
ten different types of nearest neighbor interactions namely 
AA/TT; AT/TA; CA/GT; GT/CA; CT/GA; GA/CT; CG/ 
GC; GC/CG; GG/CC. Using this information the overall 
stability (as a measure of AG) and melting behavior of a 
sequence can be predicted. 

Propeller twist signals, bending stiffness and 
nucleosomal positioning 

DNA must distort in order to bend around a protein: this 
distortion is facilitated by the deformational capacity of 
dinucleotides. Some are practically rigid whereas others 



Nucleic Acids Research, 2011, Vol. 39, No. 16 6867 



Sequence(s) 



1 






Parameter Selection (P) 






Window Size (w) 






Start Position (x) 






1 




Loading of 






parameter file 






rpp 

j 






i 




Substring generation 






(m) of each 




sequence (s) of size w al 




position x 





Word Formation 
(Di/Tri-nucleotides) 
of all the substrings at position x 



Average Score 
at position x (S x ) 



Redo till sequence end 



Graphical plot for 
Parameter wrt position 
using GNUPLOT 



Redo for all 
selected parameters (P) 



Figure 2. Flow chart depicting the sequence of procedures followed by DNA SCANNER to generate profiles for given DNA sequences. The score as 
function of position (x) is computed as S x = Yl'f=\ Tf ■ T? 
file summed over the m substrings generated for a window of size w. 



Hf=\Tf ■ Tf is defined as the parametric score of substring j (di/trinucleotide) derived from parameter 



are flexible, and the propeller twist parameter measures the 
tendency of DNA to twist about the long axis that makes 
the two bases of a pair non-coplanar (20). Dinucleotides 
having a large propeller twist tend to be more rigid than 
dinucleotides with low propeller twist; thus propeller twist 
can also be used as a measure of DNA flexibility. 

The conformational and mechanical properties of the 
DNA double helix vary in a sequence-dependent manner 
(36). Specifically, nucleosomal rotational setting and bend- 
ability is a function of underlying nucleotide sequence 
which contributes heavily on placement of nucleosome 
at specific position. Nucleosomal DNA is highly bent: his- 
tories prefer sequences which have the potential to bend 
for the formation of nucleosomal particles. Sivolob and 
Kharpunov (36) have shown that DNA sequences possess- 
ing low-bending energy correlate with potential bending 
and nucleosomal positioning. 

Protein interaction signals 

The DNA sequence carries signals specific for its potential 
to deform when interacting with other molecules such as 
proteins and also during important biochemical processes 
such as transcription, replication and retro-transposition. 
This deformability can also serve as potential long range 
signals for molecular recognition and conformational rec- 
ognition (37), and based on information extracted from 
existing protein-DNA crystal complexes, empirical energy 
functions have been deduced. This gives the potential de- 
formability of a given sequence and its potential to 
interact with proteins. 



DNASCANNER has been developed in Perl/CGI-Perl 
and is modular in nature so that new properties can be 
included without changing the core code. The parametric 
data pertaining to the various physicochemical properties 
discussed above is stored in a separate flat file database; 
new rules can therefore be added or existing rules may be 
modified easily. The program is available online at http:// 
tinyurl.com/dnascanner. 

ISF 

The information generated by DNASCANNER from 
positive and negative datasets is used by ISF to construct 
set of rules to classify and 'predict' insertion sites in 
genome. 

Training and testing 

Flanking sequences extracted from known insertion sites 
constitute pre-insertion loci which we take as the positive 
dataset, Class P. By shuffling these sequences we obtain a 
negative dataset Class N a . This set of sequences maintains 
the base composition, but should not provide any suitable 
insertion sites. An independent negative dataset, Class Nb 
was constructed from randomly picked sequences in the 
vicinity of true insertion sites, and a third independent 
negative dataset, class N c is constructed from sequences 
taken out of coding regions in the genome. Profiles for 
each property for both positive and negative datasets 
were generated as described above and averaged within 
each class. The position of the extremum (maximum or 
minimum) was noted for each parameter for the positive 
dataset (Figure 3) and was considered significant if it 
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exceeded 2 SD from the mean background (Table 1). The 
profile was also computed for the negative datasets and 
only properties that showed a strong differential between 
positive and negative sets were included in the model. 

The nature and positions of an extremum differed in 
each of the chromosomes. Parametric files for each prop- 
erty were constructed and the maximum (or minimum) 
value and position were noted for each significant property 
(Table 1). This information was used as a quantitative 
measure to differentiate between positive and negative 
examples for a given chromosome (see Results section as 
well as Supplementary Data, tutorial.html). 
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Figure 3. Adenine density upstream of insertion sites of Alu in human 
chromosome 2. The ;'-axis represents the value of the property under 
study (here this is the 'A Rule') and the x-axis represents the position 
with respect to insertion site (taken as position 0). In all cases, the 
properties we examine have been computed for both positive and 
negative datasets. 



We base our scoring protocol to model insertion sites 
(P) or non-insertion sites (TV) on Bayes' rule (38). Using 
the thermodynamic, biophysical and chemical properties 
discussed above the posterior probability or score of an 
insertion site, P(Pj\Sj) is computed. This is the probability 
of classifying a DNA string as an insertion site given 
that property S, is true for this site. P(P) and P(N) are 
the prior probabilities of sites obtained from classes P and 
N, respectively; here the two classes are taken to be equi- 
probable. P(S,/P) denotes the conditional probability 
of the property being from class P calculated from 
training data. P(S,\N) can be computed in a similar 
manner, and the highest posterior is then taken as the 
true prediction. 

Insertion sites are characterized by diverse parameters. 
The sensitivity and specificity based upon single property 
scores as determined by Bayes' rule were only ~45-55%. 
We therefore decided to use a combination of param- 
eters in machine learning algorithms such as voting (39), 
Adaboost (40) and support vector machines (SVMs) (41). 
Extensive testing revealed that SVM performed better 
than other methods and emerged as the technique of 
choice in the present problem. A standard SVM was 
used to construct models for insertion sites of different 
elements, using binary scores (1 or 0). 

The details of the procedure are as follows. Consider, as 
an example, the problem of finding Alu in Human 
Chromosome 22 (HC22). The training set Z = (X, Y) 
consist of both positive and negative examples, each of 
which is characterized by a (i-dimensional vector. Set 
X contains insertion site examples (labeled +1) denoted 
by P and non-insertion site examples (labeled 0) as N. In 
order to implement SVMs, each insertion site was con- 
verted into feature vectors as described. The training set 
contained 595 insertion site and the same number of nega- 
tive examples. Different kernels were used for learning and 



Table 1. Information derived from DNASCANNER analysis of Human chromosome 2 (HC2) for Alu and LI elements 



Element Human Chrosome 2 



LINES (LI) SINES (Alu) 



Number of sequences taken 
Property 






209 








5334 




Position 


Trend 


Value 


Mean (SD) 


Position 


Trend 


Value 


Mean (SD) 


Trule 


-28 


u* 


0.347 


0.29 (0.03) 










Arule 


-14 


U 


0.530 


0.36 (0.05) 


-13 


V 


0.503 


0.342 (0.05) 


Grule 










-15 


D* 


0.140 


0.194 (0.04) 


Crule 


-14 


D 


0.082 


0.17 (0.03) 


-13 


D 


0.102 


0.173 (0.02) 


Atrule 


-14 


U 


0.769 


0.65 (0.05) 


-14 


U 


0.755 


0.632 (0.05) 


Bendability 


-12 


D 


-0.023 


-0.012 (0.003) 


-12 


D 


-0.02 


-0.01 (0.00) 


Nucleosomal_positioning 


-13 


D 


-3.887 


-1.48 (0.91) 


-14 


D 


-3.64 


-1.14 (1.00) 


b-a trimeric 


-11 


U 


0.285 


0.247 (0.01) 


-12 


U 


0.281 


0.244 (0.01) 


DNA denaturation 


-15 


D 


36.774 


39.21 (1.06) 


-14 


D 


37.03 


39.55 (1.10) 


Duplexstability 


-14 


u* 


-0.659 


-0.72 (0.03) 


-14 


U 


-0.66 


-0.73 (0.03) 


Propellartwist 


-13 


D 


-7.454 


-6.87 (0.23) 


-13 


D 


-7.39 


-6.79 (0.24) 


Stabilizing energy 


-12 


U 


1.827 


1.72 (0.04) 


-12 


U 


1.796 


1.704 (0.04) 


Stacking energy 


-17 


u 


-3.313 


-3.62 (0.12) 


-14 


U 


-3.33 


-3.65 (0.13) 


Bending stiffness 


-14 


D 


22.253 


26.88 (2.11) 


-14 


D 


22.95 


27.44 (0.09) 


Protein-induced deformability 


-12 


D 


1.948 


2.18 (0.09) 


-12 


D 


1.949 


2.186 (0.09) 



The Trend column refers to whether this is a maxima (V) or minima (D). 

*denotes that value of a given property (extrema) lie between mean +/- 2 Standard Deviation. 
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tested on an independent dataset containing the same 
number of negative and positive instances and compared 
with SVM-LIGHT (42). 

Training, testing and optimization were done for each 
element and each chromosome separately. Parameters 
were obtained for each feature, element and genomic 
sequence (organized in a hierarchical manner as 
catalogued at http://nldsps.jnu.ac.in/elan.html). For 
instance, for a given rule (say the A rule) the following 
important attributes were identified: 

(i) The extremum position (1) i.e. —13 bp (Table 1). 

(ii) The extremum value i.e. 0.503 (50% A density). 

(iii) The nature of the extremum, namely is it a 
minimum or maximum. 

(iv) The presence of additional peak in the flanks 
(±5 bp). 

These values are derived from training dataset for each 
element on a given chromosome and optimized using 
ROC curves via the sensitivity and specificity values 
(Supplementary Figure S2). 

Consider a curated set of n examples of known insertion 
sites and non-insertion sites and d features (or rules) on 
which to base this classification. Associated with each in- 
sertion site i k is a rf-dimensional vector. The SVM algo- 
rithm was trained using this dataset and the resultant 
model was tested with an independent dataset comprising 
both positive and negative examples. Given training sets 
there is an optimal hyperplane which separates the 
training data into two classes and this can be determined 
via standard methods (Supplementary Figure S3). 
Training and testing examples files are given in the 
Supplementary Data and on http://nldsps.jnu.ac.in/ 
pages/isf.html. 

Performance of the different rules 

Sensitivity is the degree to which true examples are cor- 
rectly detected whereas specificity is defined as the extent 
to which false instances are rejected successfully. The 
average of sensitivity and specificity is a measure of the 
overall accuracy of a specific combination of parameters 
used in ISF. 

The performance of ISF for predicting insertion sites is 
evaluated chromosome-wise and the accuracy of SVMs 
for Alu elements in the human genome is given in 
Supplementary Table SI. As can be seen, this can be 
quite high, reaching 73% accuracy for chromosome 22. 
Even higher accuracy is achieved in E. histolytica, 80% 
or more for some elements. Comparable results were 
obtained with elements in other genomes and LI in 
H. sapiens. In the initial stage of application of ISF, 
cutoff values for each parameter was calculated to be 
highest local maxima or minima observed at specific 
position in plots for each rule. These cutoffs were 
optimized by maximizing the overall accuracy as shown 
in Supplementary Figure S2. 

It should be pointed out that individual rules did not 
perform as well as their combinations did (Supplementary 
Table S2) but as discussed above, this did not depend on 
which SVM kernels were used. The accuracy remained at 



the 66% level for any given rule, but combination rules 
worked better. A worked example is given online at the 
ELAN website, using the Alu element in Chromosome 22 
of the human genome in the form of tutorial/screen shots 
along with a sample dataset (see Supplementary Data, 
tutorial.html). 



RESULTS 

In order to assess the applicability of the tool in different 
contexts, ELAN has been applied to over 50 genomes, 
including E. histolytica, Caenorhabditis elegans, Canis 
lupus familiaris, Macaca mulata, M. musculus and 
H. sapiens. Results obtained from these studies are avail- 
able through the InSiDe database, as well as on the web 
site http://nldsps.jnu.ac.in/elan.html. Two examples are 
discussed here in detail, an analysis of Alu and LI in the 
human genome, both well studied MGEs. 

Our analysis detects ~1.2 million copies (Supplementary 
Table S3) of Alus in the human genome, similar to previ- 
ously reported values (43). The chromosome-wise distri- 
bution of Alu copies is shown in Supplementary Table S3, 
the number of Alus is roughly proportional to the 
chromosome length (Figure 4). There are few functional 
Alu copies in the human genome, most being truncated at 
either the 5' or 3' end (or both). Their distributions on the 
various chromosomes follow similar patterns. 

As discussed in the 'Materials and Methods' section, we 
construct pre-insertion loci and classify them into the four 
groups (Supplementary Table S3): Intact on both ends, 
Intact on 5', Intact on 3' and Intact on neither end. 
These are then analyzed for different physicochemical 
properties using DNASCANNER (30). 

Alu elements tend to preferentially insert in the vicinity 
of A-rich regions (Figure 3), where thermodynamic, struc- 
tural and nucleosomal positioning parameters are also 
markedly different as compared to genomic average 
values. These deviations were found to be statistically sig- 
nificant: the Mann Whitney P-values were typically below 
0.05 (see 'Materials and Methods 1 section). These patterns 
are observed in all chromosomes, with the largest devi- 
ation being between —12 and —15 bp from the insertion 
site (Figure 5). Signals are mainly seen in the 5' intact class 
(with 3' either intact or truncated) while 5' truncated 
elements show signals only for some of the parameters, 
the location and level being inconsistent for different 
chromosomes (data not shown). 

Each of the signals is converted into a graphical profile 
that typically shows the relevant quantity peaking in the 
vicinity of the insertion site. To validate that a given profile 
is significant, we compute the average and standard devi- 
ation for each parameter profile and for each class of Alu. 
The local extremum, that is, maximum or minimum is 
considered significant only if it exceeds the mean ± 2 SD 
(Table 1). An important indicator of the existence of a 
pattern is that it should not occur where Alus do not 
insert, and in our work we construct specific negative 
datasets for this purpose. Further validation is provided 
by randomly removing one third of the positive dataset 
repeatedly. The pattern persists, suggesting that the signals 
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Figure 4. Distribution of Alu element across human genome. Four classes of elements (see 'Materials and Methods' section) are indicated with four 
different colors. The j'-axis represents the frequency of elements found on the different chromosomes (marked along the .r-axis). 





Figure 5. Various signals upstream of the insertion sites of Alu in chromosome 2. The y axis represents value of the property and the x-axis gives the 
relative position with respect to the insertion site (taken to be 0) (Figure 3). 
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are common to all members of the set. Similar signals are 
observed in flanking regions of Alu present in genes 
(Supplementary Figure S4). 

The biological relevance of signals in the vicinity of in- 
sertion sites derives from the fact that some of these are 
structural in nature, and some pertain to the energetics. 
Those that we employ here include: 

(i) DNA bendability which measures the mechanical 
flexibility around insertion sites. We find a region 
of low bendability (from —12 to —15 bp) followed 
by sharp increase in the flexibility upstream of Alu 
insertion sites (Supplementary Figure 5A). 

(ii) Propeller twist, which also measures the flexibility 
of DNA via its tendency to twist about the long 
axis that makes the two bases of a pair non- 
coplanar (20). Dinucleotides having a large propel- 
ler twist tend to be more rigid than those with 
low-propeller twist. Since proteins encoded during 
TPRT seem to distort DNA, sites with specific di- 
nucleotides that are more rigid and are followed by 
a lower propeller twist appear to be amenable for 
insertion (Figure 5). 

(iii) Bending stiffness and nucleosomal positioning: nu- 
cleosomes play an important role in DNA com- 
pacting as well as in providing transcription 
factors access to regulatory regions. Since this is es- 
sential for activation of gene expression (44), we ex- 
amined two different nucleosomal related features, 
the bending energy/persistence length (36) and the 
nucleosomal positioning profiles (36). We find that 
insertion sites have a low-energy region be- 
tween positions —31 and —11 with significant min- 
imum at position —14 (Supplementary Figure S5). 
Similar results were obtained using bending-energy 
profile (Supplementary Figure S5B). 

(iv) Stacking energy profiles show a peak near suitable 
insertion sites, indicating that high energy regions 
are intrinsically unstable, leading to easy 
de-stacking or melting of DNA sequence that 
enables Alu insertion (Supplementary Figure S5C). 

(v) Duplex stability is a measure of the relative stabil- 
ity of the DNA-duplex structure and is dependent 
on sequence (23), and more specifically upon 10 
different types of nearest neighbor interactions 
namely AA/TT; AT/TA; CA/GT; GT/CA; CT/ 
GA; GA/CT; CG/GC; GC/CG; GG/CC. Using 
this, we find that the region around the —13 
position is destabilized more easily compared to 
controls (Supplementary Figure S5D) in suitable 
insertion sites. 

(vi) The DNA-denaturation-energy profile 
(Supplementary Figure S5E) indicates that a rela- 
tively small amount of energy is required to melt 
DNA near insertion sites favoring retrotran- 
sposition (37 Kcal/mol at —13 bp). 

(vii) Protein interaction signals, assessed via the deform- 
ability of the DNA show that a region of low de- 
formability, followed by one of high deformability 
(Figure 5) facilitates retrotransposon insertion. 



About 100000 copies of LI are found uniformly 
distributed in each chromosome of the human genome. 
Most copies (73%) are truncated on both ends, 21 852 
(20%) are truncated on the 5' -end and 1938 copies 
(1.8%) are truncated on the 3'-end. Thus intact copies of 
LI number 3663 (fewer than 4%) indicative of the extreme 
mutability of these elements in the face of very weak 
selection pressure (Table 2). The pre-insertion loci of LI 
also share similar signals with Alu loci (Supplementary 
Figure S6). 

Analysis of E. histolytica genome 

In E. histolytica, an early branching unicellular eukaryote 
that occupies a unique evolutionary position, MGEs 
occupy only 11% of genome. The pre-insertion loci of 
EhLINEl, EhLINE2, EhSINEl and EhSINE2 were con- 
structed as described above. All elements show similar 
insertion preferences, for example, T richness at the site 
of insertion in contrast to A-rich regions in the human 
genome (for both Alu and LI elements). A number of 
signals such as the propeller twist, stacking energy, 
bendabilty profile, free-energy profile, DNA-denaturation 
energy, protein-induced deformabilty and nucleosomal- 
related features are present in the 5' upstream region. 
They are absent in negative datasets constructed specific- 
ally for the E. histolytica genome, namely scrambled posi- 
tive dataset, sites picked from vicinity of pre-insertion loci, 
sites picked from prokaryotic genomes with comparable 
AT richness, and sites picked from within genes where 
elements are known to not insert. 



Table 2. LI element distribution on the different chromosomes on 
the human genome 



Chromosome 


Total 


Truncated 


5' Truncated 


5' Intact 


Intact 


number 


elements 


on both 


and 


and 


on both 






ends 


3' intact 


3' truncated 


ends 


1 


7377 


5432 


1557 


166 


222 


2 


8451 


6239 


1771 


155 


286 


3 


7766 


5665 


1666 


154 


2S1 


4 


8336 


6005 


1819 


170 


342 


5 


7434 


5355 


1650 


138 


291 


6 


6641 


4849 


1441 


126 


225 


7 


5406 


4045 


1064 


103 


194 


8 


4153 


2996 


886 


86 


185 


9 


4288 


3188 


890 


68 


142 


10 


4238 


3125 


891 


75 


147 


11 


4971 


3575 


1110 


96 


190 


12 


4468 


3238 


963 


84 


183 


13 


3658 


2735 


785 


49 


89 


14 


3188 


2335 


684 


50 


119 


15 


2565 


1953 


474 


42 


96 


16 


1654 


1241 


321 


33 


59 


17 


1417 


1104 


263 


20 


30 


18 


2544 


1926 


516 


39 


63 


19 


1005 


787 


154 


25 


39 


20 


1440 


1080 


296 


14 


50 


21 


1180 


943 


209 


7 


21 


22 


631 


526 


82 


7 


16 


X 


10221 


7751 


1981 


173 


316 


Y 


1509 


995 


379 


58 


77 


Total 


104 541 


77088 


21852 


1938 


3663 
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Insertion sites of MGE in other genomes also show similar 
signals 

We analyze genomes of other organisms for the presence 
of signals in regions upstream of their respective insertion 
elements. 

In M. musculus genome, preinsertion loci of the Bl 
element were investigated (Supplementary Figure S7) and 
it was observed that almost all the physicochemical param- 
eters discussed above show significant extrema between 
— 12- and — 15-bp upstream of the insertion site of intact 
(or full length) elements. Alu insertion sites in M. mulata 
(Supplementary Figure S8) also have same characteristics 
(see Supplementary Data, Master_Supplementary_file.doc). 

Application of DNASCANNER and ELAN on the human 
genome 

Some human diseases are linked to genes known to be 
susceptible to retrotransposon insertions (45). A plausible 
hypothesis is that within the coding region of such genes 



are signals that may be recognized as insertion sites. We 
applied DNASCANNER to a number of such genes. In 
order to investigate the optimal insertion environment 
within the coding region, a number of representative se- 
quences were selected. Known sites of insertion and a 
flanking region of length 1000 bp were extracted from gen- 
ome databases. Controls were selected in the same gene in 
the vicinity of insertion site as well as randomly picked 
regions from the same gene. We conducted detailed 
analyses of genes reported to be susceptible to TE inser- 
tion leading to disease (46,47), including DMD (48), 
CYBB (49) and APC (50) for LI, Alu and SVA insertions. 
Details of the identified site and orientation of insertion 
are given in Supplementary Table final-disease-gene-table 
.xls. Pre-insertion loci were constructed by removing the 
TE as well as the TSD. In most cases, significant signals 
were observed in the flanks of disease genes although these 
are not seen at Alu and LI insertion sites (see Figure 6 
and Supplementary Figures at http://nldsps.jnu.ac.in/TE- 
Vs-diseases/database/). These were classified as typical, 




Figure 6. (A) DNA-denaturation profile of pre-insertion loci of ABCD1 gene reported to be disrupted by Alu element at position 0. (B) Bendability 
profile of preinsertion loci of Dystrophin gene reported to be disrupted by LI element at position 0. (C) DNA-denaturation profile of pre-insertion 
loci of Spectrin gene reported to be disrupted by SVA element at position 0. (D) DNA-denaturation profile of pre-insertion loci of APC gene 
reported to be disrupted by Alu element at position 0. In this example the 1000 bp of sequence flanking insertion site of LI element. 
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atypical or negative: typical signals were those similar to 
Alu and LI insertion sites, a single statistically significant 
extremum 10-20-bp upstream the insertion site. Examples 
included gene such as SERPING1. (Supplementary 
Data at final-disease-gene-table.xls) The atypical had sig- 
nals outside from the —10- to — 20-bp region or had two 
extrema. The negative class did not show any significant 
signal. Of the 49 genes studied, 1 1 genes were typical, in 
this sense 27 genes were atypical and 1 1 showed no signals 
(see http://nldsps.jnu.ac.in/elan.html). A representative 
atypical case is that of DMD, one of the largest genes 
known, of length 2.4 Mb. An instance of LI insertion 
has been reported to cause disruption in the gene at 
position 128 269 leading to Duchene Myotrophy disease 
(51,52). Flanking regions ranging from 100 bp to 5 Kb 
were analyzed which revealed the presence of insertion 
signals in the vicinity. This and other examples can be 
seen online at http://nldsps.jnu.ac.in/TE-Vs-diseases/ 
database/ as well as in Supplementary Data (final- 
disease-gene-table.xls). Based upon our limited study of 
49 genes, we assume that the absence of typical signals 
within exonic regions prevents wide-spread disruption 
of coding regions by Alu and Lis. Several copies of 
LI and Alu are already present in the intronic regions of 
these genes; this suggests that our understanding of the 
role of these elements in causation of diseases by gene 
disruption is at present incomplete and further analysis 
is necessary 

To investigate the role of insertion sites or signals in 
gene regulation, we applied DNA SCANNER to a 
promoter dataset derived from eukaryotic promoter 
database (EPD) (53). It appears that structural features 
of sequences within 500 bp (upstream or downstream) of 
the TSS of several eukaryotic genes also have characteris- 
tic signals (see Supplementary Data at the program web 
site) although there are significant differences in the par- 
ameters in the upstream and downstream regions. Signal 
extrema are mostly seen in vicinity of TSS in a variety of 
examples from Xenopus laevis, Gallus gallus, M. musculus, 
Rattus norvegicus, Bos taurus, as well as several plants 
promoters (Figure 7). Similar signals have also been seen 
at splice sites of Drosophila (Supplementary Figure S9) as 
well as other organisms. 

Element detection within ELAN 

In the present work, we use the module ELEFINDER 
within ELAN to detect copies of known MGEs such as 
Alu or LI. This was benchmarked against the standard, 
RepeatMasker (A.F.A. Smit et al., unpublished data, 
www.repeatmasker.org) and also compared to other 
existing tools. 

Our algorithm detects 99.5% of the Alu insertions in 
human chromosome 22 (HC22) that are located by 
Repeatmasker run with the following parameters: 
RepeatMasker -alu -s -species human -no_is hs-chr22.fa. 
(see Supplementary Data RM-ELAN-Ch22.xls at web 
site). ELAN detects 24 135 Alu copies, whereas 
Repeatmasker detects 24248; the difference of 113 
(0.5%) could be due to merging step of ELAN. A third 
program CENSOR (54) uses the Smith-Waterman 



nucleotide alignment to detect repeats and outputs 
masked genomic DNA along with a tabular summary of 
TE content. The online version of CENSOR (http://www. 
girinst.org/censor/index.php) with standard parameters 
found 24237 copies of Alu, slightly more than either 
Repeatmasker or ELAN. A full report is given in the 
Supplementary Data at the web site. 

ISF 

The graphical results obtained from DNA SCANNER 
were made quantitative in the following manner. For 
each of the different structural or energetic features that 
were assessed in the vicinity of insertion sites, the location 
of the extremum (in base pair), its nature [minimum (D) or 
maximum (If)], the value at the extremum, the average 
and standard deviation (in order to assess the back- 
ground). An example is described in Table 1 which sum- 
marizes results for Alu and LI in human chromosome 2 
(HC2). 

Consider the A density. This has a maximum 
(U = 0.503) at 13-bp upstream of Alu insertion sites, 
with average and standard deviation of the overall 
profile being 0.34 and 0.05, respectively. The peak is 
thus significant and is included in the training. Any par- 
ameter whose values were judged to be non-significant was 
excluded from final classification model. A PERL script 
was used for automation of this procedure (http://nldsps. 
jnu.ac.in/dna2isf.html or tutorial.html). All Alu-insertion 
sites whose A-density profiles show a maximum at — 1 3 bp 
were considered to be positive for A rule, and the overall 
sensitivity, namely the fraction of insertion sites detected 
by ISF was found to be 0.64. This quantity was computed 
for each property (see Supplementary Table S4). 



Sensitivity 



TP 



TP+FN 



The specificity is the fraction of non-insertion sites rejected 
by ISF for a given rule, and as above, the results are 
summarized in Supplementary Table S4 for Alu insertion 
in human chromosome 2. For most properties these indi- 
cators lie between 0.6 and 0.7 indicating that individually 
they provide some level of discrimination between real 
insertion sites and sites that are not suitable for insertion. 

Cutoff values for each parameter were calculated as 
the highest extremum observed at specific positions. 
Sensitivity and specificity at several values for a given par- 
ameter (Supplementary Figure S2) were computed and 
optimized (Supplementary Table S5): this removes the im- 
balance between sensitivity and specificity values and 
improves overall performance. After optimizing each par- 
ameter, these new cutoff values are used for classification. 
The dataset was divided equally into training and test sets 
(Supplemental Data, trainalu.txt and testalu.txt) and for 
each a property vector was constructed, basically as a col- 
umn of l's and 0's, each entry corresponding to whether a 
each property scored above the threshold value (1) or did 
not (0) (see Supplementary Data, tutorial.html). 

These vectors were then used as training data for a SVM 
in the standard manner; see file trainalu.txt. Testing was 
done on an independent dataset (file testalu.txt) 
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Figure 7. (A) The DNA-denaturation profile of DNA sequences from EPD comprising B. taunts promoters (—500 bp) and genes (400 bp). The +1 
represent TSS of the gene. The window size was 100 bp of total length. (B) Propeller twist profile of same dataset. (C) Propeller twist profile of 
promoters of Xenopus. (D) DNA-denaturation profile of viral genes. 



comprising of an equivalent number of positive and nega- 
tive examples. We evaluated each of the parameters for a 
given chromosome separately to assess the performance 
of ISF and results for classifiers for each of the human 
chromosomes for Alu insertion sites are given in 
Supplementary Table SI. The highest accuracy was ob- 
tained from polynomial kernel in the SVM reaching 
90% for human chromosome X (Supplementary Table 
SI), and over 80% in E. histolytica. Comparable results 
were obtained with other elements and other genomes as 
well. 

We also investigated the effect of individual parameters 
and kernels on the accuracy of the SVM and the data is 
given in Supplementary Table S2. Although the individual 
rules performed poorly when used alone, the overall per- 
formance increased significantly when used in combin- 
ation. Rules were then deleted systematically from the 
set to examine the effect on the performance under each 
kernel and to arrive at a minimal set of rules that gives 
maximal predictive accuracy (Supplementary Data). 



DISCUSSION 

MGEs have been termed drivers of evolution (55,56) in 
that they effectively expand the genome by insertion, 
choose optimal insertion locations, create longer inter- 
genic and non-coding sequences — possibly also introns — 
and thereby promote genomic diversity. Intergenic se- 
quences presumably have fewer mutational constraints 
than coding sequences, and the lack of selection pressure 
appears to offer more versatility in evolution. 

The identification of locations on the genome where 
such elements can successfully insert is thus important, 
and in this article, we describe a set of bioinformatics 
tools for the analysis and prediction of putative MGE 
insertion sites. Such locations are an important factor 
for spreading of MGEs as well as genome expansion in 
general, and the tool that we have developed here, ELAN 
uses a combination of methods ranging from sequence 
comparison to pattern recognition, machine learning, 
and classification. The identification of elements, their 
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distribution and their relationship with specific genes can 
be analyzed in detail, and intergenomic comparisons are 
possible as well. 

During insertion, a non-LTR element causes the target 
site to distort in a number of ways and requires the co- 
operative action of a number of proteins such as reverse 
transcriptase and endonuclease to break bonds, unwind 
the DNA, and most importantly, to nick the target site 
strand (14). Insertion sites therefore show significant struc- 
tural patterns owing to the need for a combination of rigid 
and flexible regions. The potential for a given genomic 
sequence to have the appropriate physical properties is 
evaluated by two components of ELAN, the programs 
ISF and DNASCANNER. 

The specificity and sensitivity of the technique depends 
on the element being analyzed and its context. Elements 
have a highly non-uniform distribution for instance there 
are ~7000 copies of LI on the largest human chromosome 
I, but the X chromosome has over 10000 copies. ELAN 
successfully predicts over 90% of the insertion sites of Alu 
using ISF on the X chromosome (Supplementary Data) 
whereas on the Y chromosome the prediction rate reduces 
to 60%. Since the present computational tool identifies 
insertion sites mainly through the recognition of 'typical' 
patterns, it would appear that the insertion sites on the X 
chromosome bear more fidelity to the standard or ideal 
insertion site. The lower predictive accuracy of ISF on the 
Y chromosome reflects the fact that the insertion sites may 
themselves have evolved (57,58). Similarly, in analyzing 
different age classes of Alu, namely the Y, S and J, 
younger elements such as Alu Y show stronger signals 
compared to older or truncated elements since the prob- 
ability of mutation of both a given element as well its 
flanking sequences increases with evolutionary time. 

The sequence environment of the insertion site has fairly 
specific motifs that depend on the element in question. The 
local sequence between 10- and 20-bp upstream of the site 
is likely the most important (although not the deciding) 



feature that enables insertion. Even locally, there can be 
considerable flexibility in sequence: studies of LI- and 
Alu-insertion sites report both canonical (TT/AAAA) 
and non-canonical (TTAAGA, TTAGAA, TTGAAA, 
TTAAAG, CTAAAA, TCAAGA, AAAAAA) insertion 
motifs for LI and Alu (59-63). While the canonical motif, 
TT/AAAA occurs most frequently in the upstream region 
(see Figure 8, and Supplementary Data for the case of 
Chromosomes 22 and Y) the other motifs are also 
present in fair measure. Indeed the overwhelming majority 
of TTAAAA motifs that occur on the genome are not 
associated with the Alu insertion sites. At the same time, 
there is considerable evidence that the 10-20-bp upstream 
region plays a major role in the retro-transposition mech- 
anism, and the TTAAAA motif is indeed frequently 
present in this region (59). Non-canonical motifs, when 
linked to elements, are also found in the same upstream 
region where the various physicochemical properties we 
have studied (see 'Materials and Methods' section) show 
maximal variation. The TT/AAAA motif is present in 
13% of Alu insertion sites examples studied in Human 
chromosome 22 in 10-20-bp upstream of Alu insertion 
sites. Our results corroborate well with earlier work (59) 
where authors showed presence of TT/AAAA at ~15- 
16-bp upstream of 400 examples of Alu insertion sites. A 
recent study suggest longer pattern containing canonical 
and non-canonical motifs around Alu insertion sites (64). 

Features those present on larger scales and which influ- 
ence element insertion are not easily identified. Alus insert 
in the immediate vicinity of A-rich regions (of size 
~ 100 bp, Figure 3) although these regions are themselves 
within GC regions on even larger scales (of the order of 
megabites). Further, Alus are enriched in gene-rich 
regions while Ll-populate-intergenic regions. An add- 
itional problem that is only partially addressed here is 
whether the current distribution of MGEs reflects their 
insertion dynamics, or is a consequence of the retention 
potential of different sites. Thus other features, in addition 



1200 



1000 



800 



600 



400 



200 



Frequency 




i Frequency 




1-10 10--20 20-30 30-40 40-50 50-60 60-70 70-80 80-! 



-100 null 



Distance of 
Alu from 
TTAAAA 


Frequency 


% 


1-10 


22 


1.49 


10-20 


202 


13.76 


20-30 


65 


4.42 


30-40 


13 


0.88 


40-50 


14 


0.95 


50-60 


19 


1.29 


60-70 


6 


0.40 


70-80 


9 


0.61 


80-90 


14 


0.95 


90-100 


8 


0.54 


null 


1096 


74.65 


TOTAL 


1468 


100 



Figure 8. Most instances of the canonical TTAAAA motifs are unrelated to known Alu insertion sites. Shown here is a histogram of distances of 
TTAAAA to the nearest Alu insertion site, and as can be seen, >70% are >100-bp away from the nearest Alus (see http://nldsps.jnu.ac.in/elan.html 
for more details). 
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to possibly long-ranged correlations are likely to affect the 
distribution of MGEs. 

It is well-known that MGEs tend to invade genomes 
in pairs of LINEs/SINEs. Examples include Ll/Alu in 
human genome, Ll/Bl in mouse and EhLINEs/ 
EhSINEs in E. histolytica. Such pairs of elements share 
insertion site preferences in addition to sharing local 
sequence similarity. Alu and Bl elements belong to the 
same family of SINEs and the signals observed at mouse 
(Bl) and human (Alu) insertion sites are strikingly similar 
(Supplementary Figures S5 and S7). This raises the possi- 
bility that the tools developed here for non-LTRs could be 
applied to other MGEs as well and our preliminary studies 
have been encouraging. For instance, the Tyl element in 
Saccharomyces cerevisiae integrates preferentially 
upstream of genes that are transcribed by RNA polymer- 
ase III (65). When these regions were analyzed, a number 
of the signals used here for non-LTRs (the A and G rules, 
bending stiffness, DNA-denaturation energy) showed sig- 
nificant signals (see Supplementary Figures and data 
http://nldsps.jnu.ac.in/Tyl). The other insertion sites 
examined were those for P elements in Drosophila (66), 
Tn7 system (67) and Sleeping beauty (68), which also 
showed some of the same signals, although the specific 
insertion mechanisms in these cases can be quite different 
from the non LTRs. This may interpreted as though the 
general mechanism behind the transposase-based mobil- 
ization of DNA transposons is completely different from 
the reverse transcriptase-based mechanisms of 
retro-transposon mobilization but the biophysical 
properties of the genomic DNA surrounding known inser- 
tion sites are similar among all insertion sites. 

The present work is an ongoing effort to develop an 
integrated system for analysis and de novo detection of 
MGEs. The tools available here are generic, and in prin- 
ciple can be adapted for use in the detection of other 
features as well. All data generated in these studies, as 
well as from our ongoing analysis of MGEs in diverse 
genomes has been curated in the database InSiDe that 
incorporates information on the distributions, insertion 
sites and element sequences. This is available at http:// 
nldsps.jnu.ac. in/inside. 



SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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